Problem 1: Use the color picker app from the
colorspace package
(colorspace::choose_color()) to create a qualitative color
scale containing five colors. One of the five colors should be
#5C9E76, so you need to find four additional colors that go
with this one.
colors <- c("#5C9E76", "#E97451", "#3D6FFF", "#963DFF", "#FF3D6C")
swatchplot(colors)
For the rest of this homework, we will be working with the
midwest_clean dataset, which is a cleaned up version of the
ggplot2 midwest dataset.
midwest_clean <- midwest %>%
select(
state, county, area, popdensity, percbelowpoverty, inmetro
) %>% # keep only a subset of data
na.omit() # remove any rows with missing data
head(midwest_clean)
## # A tibble: 6 Ă— 6
## state county area popdensity percbelowpoverty inmetro
## <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 IL ADAMS 0.052 1271. 13.2 0
## 2 IL ALEXANDER 0.014 759 32.2 0
## 3 IL BOND 0.022 681. 12.1 0
## 4 IL BOONE 0.017 1812. 7.21 1
## 5 IL BROWN 0.018 324. 13.5 0
## 6 IL BUREAU 0.05 714. 10.4 0
Problem 2: Perform a PCA of the
midwest_clean dataset and make a rotation plot of
components 1 and 2.
Below shows a summary of the PCA performed on the
midwest_clean dataset:
midwest_pca <- na.omit(midwest_clean) %>%
select(where(is.numeric)) %>% # retain only numeric columns
select(-c("inmetro")) %>% # remove categorical variable
scale() %>% # scale to zero mean and unit variance
prcomp() # performing principal component analysis
aug_midwest <- midwest_pca %>%
augment(na.omit(midwest_clean))
summary(midwest_pca)
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 1.0597 0.9750 0.9625
## Proportion of Variance 0.3743 0.3169 0.3088
## Cumulative Proportion 0.3743 0.6912 1.0000
arrow_style <- arrow(
angle = 20, length = grid::unit(6, "pt"),
ends = "first", type = "closed"
)
midwest_pca %>%
# extract rotation matrix
tidy(matrix = "rotation") %>%
pivot_wider(
names_from = "PC", values_from = "value",
names_prefix = "PC"
) %>%
ggplot(aes(PC1, PC2,color = factor(column))) +
geom_segment(
xend = 0, yend = 0,
arrow = arrow_style,
size = 1, alpha = 1
) +
ggtitle("Rotation Matrix") +
labs( #adding labels
x = "Principal Component 1",
y = "Principal Component 2",
subtitle = "Figure 1"
) +
xlim(-0.65, 0.65) +
ylim(-0.15, 1) +
scale_color_manual(
name = "Variables",
values = colors, #custom palette created in question 1
guide = guide_legend(nrow=1)
) +
theme_bw( #adding a theme for visualization
) +
theme( #aesthetics
legend.position = "top",
axis.line = element_line(colour = "black"),
panel.border = element_blank(),
panel.background = element_blank()
)
Problem 3: Make a scatter plot of PC 2 versus PC 1 and color by state. You should use the custom colorscale you created in Problem 1. Then use the rotation plot from Problem 2 to describe where Chicago, Illinois can be found on the scatter plot. Provide any additional evidence used to support your answer.
ggplot(
aug_midwest,
aes(.fittedPC1, .fittedPC2, color = state)
) +
geom_point(
size = 3,
alpha = 0.75,
shape = 16
) +
ggtitle("Midwest Principal Components") +
labs( #adding labels
x = "Principal Component 1",
y = "Principal Component 2",
subtitle = "Figure 2",
caption = "Chicago, Illinois is in Cook County, shown and labeled in the top right corner of the scatter plot."
) +
scale_color_manual(
name = "State",
values = colors, #custom palette created in question 1
) +
guides(
color = guide_legend(override.aes=list(shape = 16, size = 3, alpha = 0.75))
) +
geom_point(
data = filter(aug_midwest,county == "COOK"),
shape = 13,
size = 5
) +
geom_text(
data=subset(aug_midwest,county == "COOK"),
aes(label="Cook County",hjust = 1.13, vjust = 0.5),
show.legend = FALSE,
color = "black"
) +
theme_bw( #adding a theme for visualization
) +
theme( #aesthetics
legend.position = "top",
axis.line = element_line(colour = "black"),
panel.border = element_blank(),
panel.background = element_blank()
)
The rotation plot, Figure 1, shows that principal component 1 could
be a potential measure for how condensed a county center or metropolitan
is. This assumption can be drawn from how popdensity has a
positive relationship whereas area and
percbelowpoverty show a negative relationship. All of the
variables, popdensity, area, and
percbelowpoverty have a positive relationship with
principal component 2 with popdensity being the most
extreme relationship. Principal component 2 could be a potential measure
for county development.
As you can see in the scatter plot, Figure 2, Cook County is labeled as the top-rightmost data point– revealing an extremely positive relationship with both principal component 1 and principal component 2. This validates our findings and our assumptions drawn from Figure 1.